conditional gradient
Heavy Ball Momentum for Conditional Gradient
Conditional gradient, aka Frank Wolfe (FW) algorithms, have well-documented merits in machine learning and signal processing applications. Unlike projection-based methods, momentum cannot improve the convergence rate of FW, in general. This limitation motivates the present work, which deals with heavy ball momentum, and its impact to FW. Specifically, it is established that heavy ball offers a unifying perspective on the primal-dual (PD) convergence, and enjoys a tighter \textit{per iteration} PD error rate, for multiple choices of step sizes, where PD error can serve as the stopping criterion in practice. In addition, it is asserted that restart, a scheme typically employed jointly with Nesterov's momentum, can further tighten this PD error bound. Numerical results demonstrate the usefulness of heavy ball momentum in FW iterations.
Adaptive Conditional Gradient Descent
Khademi, Abbas, Silveti-Falls, Antonio
Selecting an effective step-size is a fundamental challenge in first-order optimization, especially for problems with non-Euclidean geometries. This paper presents a novel adaptive step-size strategy for optimization algorithms that rely on linear minimization oracles, as used in the Conditional Gradient or non-Euclidean Normalized Steepest Descent algorithms. Using a simple heuristic to estimate a local Lipschitz constant for the gradient, we can determine step-sizes that guarantee sufficient decrease at each iteration. More precisely, we establish convergence guarantees for our proposed Adaptive Conditional Gradient Descent algorithm, which covers as special cases both the classical Conditional Gradient algorithm and non-Euclidean Normalized Steepest Descent algorithms with adaptive step-sizes. Our analysis covers optimization of continuously differentiable functions in non-convex, quasar-convex, and strongly convex settings, achieving convergence rates that match state-of-the-art theoretical bounds. Comprehensive numerical experiments validate our theoretical findings and illustrate the practical effectiveness of Adaptive Conditional Gradient Descent. The results exhibit competitive performance, underscoring the potential of the adaptive step-size for applications.
Scalable DC Optimization via Adaptive Frank-Wolfe Algorithms
We consider the problem of minimizing a difference of (smooth) convex functions over a compact convex feasible region $P$, i.e., $\min_{x \in P} f(x) - g(x)$, with smooth $f$ and Lipschitz continuous $g$. This computational study builds upon and complements the framework of Maskan et al. [2025] by integrating advanced Frank-Wolfe variants to reduce computational overhead. We empirically show that constrained DC problems can be efficiently solved using a combination of the Blended Pairwise Conditional Gradients (BPCG) algorithm [Tsuji et al., 2022] with warm-starting and the adaptive error bound from Maskan et al. [2025]. The result is a highly efficient and scalable projection-free algorithm for constrained DC optimization.
- Europe > France > Bourgogne-Franche-Comté > Doubs > Besançon (0.05)
- Europe > Spain > Aragón (0.04)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- (3 more...)
Blended Conditional Gradients: the unconditioning of conditional gradients
Braun, Gábor, Pokutta, Sebastian, Tu, Dan, Wright, Stephen
We present a blended conditional gradient approach for minimizing a smooth convex function over a polytope P, combining the Frank--Wolfe algorithm (also called conditional gradient) with gradient-based steps, different from away steps and pairwise steps, but still achieving linear convergence for strongly convex functions, along with good practical performance. Our approach retains all favorable properties of conditional gradient algorithms, notably avoidance of projections onto P and maintenance of iterates as sparse convex combinations of a limited number of extreme points of P. The algorithm is lazy, making use of inexpensive inexact solutions of the linear programming subproblem that characterizes the conditional gradient approach. It decreases measures of optimality (primal and dual gaps) rapidly, both in the number of iterations and in wall-clock time, outperforming even the lazy conditional gradient algorithms of [arXiv:1410.8816]. We also present a streamlined version of the algorithm for the probability simplex.
- North America > United States > Wisconsin > Dane County > Madison (0.14)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- Europe > Russia (0.04)
- Asia > Russia (0.04)
Heavy Ball Momentum for Conditional Gradient
Conditional gradient, aka Frank Wolfe (FW) algorithms, have well-documented merits in machine learning and signal processing applications. Unlike projection-based methods, momentum cannot improve the convergence rate of FW, in general. This limitation motivates the present work, which deals with heavy ball momentum, and its impact to FW. Specifically, it is established that heavy ball offers a unifying perspective on the primal-dual (PD) convergence, and enjoys a tighter \textit{per iteration} PD error rate, for multiple choices of step sizes, where PD error can serve as the stopping criterion in practice. In addition, it is asserted that restart, a scheme typically employed jointly with Nesterov's momentum, can further tighten this PD error bound.
Stochastic Constrained Decentralized Optimization for Machine Learning with Fewer Data Oracles: a Gradient Sliding Approach
Nguyen, Hoang Huy, Li, Yan, Zhao, Tuo
In modern decentralized applications, ensuring communication efficiency and privacy for the users are the key challenges. In order to train machine-learning models, the algorithm has to communicate to the data center and sample data for its gradient computation, thus exposing the data and increasing the communication cost. This gives rise to the need for a decentralized optimization algorithm that is communication-efficient and minimizes the number of gradient computations. To this end, we propose the primal-dual sliding with conditional gradient sliding framework, which is communication-efficient and achieves an $\varepsilon$-approximate solution with the optimal gradient complexity of $O(1/\sqrt{\varepsilon}+\sigma^2/{\varepsilon^2})$ and $O(\log(1/\varepsilon)+\sigma^2/\varepsilon)$ for the convex and strongly convex setting respectively and an LO (Linear Optimization) complexity of $O(1/\varepsilon^2)$ for both settings given a stochastic gradient oracle with variance $\sigma^2$. Compared with the prior work \cite{wai-fw-2017}, our framework relaxes the assumption of the optimal solution being a strict interior point of the feasible set and enjoys wider applicability for large-scale training using a stochastic gradient oracle. We also demonstrate the efficiency of our algorithms with various numerical experiments.
- North America > United States > Georgia > Fulton County > Atlanta (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- Asia > Japan > Honshū > Tōhoku > Fukushima Prefecture > Fukushima (0.04)
Projection-free Distributed Online Learning with Strongly Convex Losses
Wan, Yuanyu, Wang, Guanghui, Zhang, Lijun
To efficiently solve distributed online learning problems with complicated constraints, previous studies have proposed several distributed projection-free algorithms. The state-of-the-art one achieves the $O({T}^{3/4})$ regret bound with $O(\sqrt{T})$ communication complexity. In this paper, we further exploit the strong convexity of loss functions to improve the regret bound and communication complexity. Specifically, we first propose a distributed projection-free algorithm for strongly convex loss functions, which enjoys a better regret bound of $O(T^{2/3}\log T)$ with smaller communication complexity of $O(T^{1/3})$. Furthermore, we demonstrate that the regret of distributed online algorithms with $C$ communication rounds has a lower bound of $\Omega(T/C)$, even when the loss functions are strongly convex. This lower bound implies that the $O(T^{1/3})$ communication complexity of our algorithm is nearly optimal for obtaining the $O(T^{2/3}\log T)$ regret bound up to polylogarithmic factors. Finally, we extend our algorithm into the bandit setting and obtain similar theoretical guarantees.
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Israel > Jerusalem District > Jerusalem (0.04)
Robust Structured Statistical Estimation via Conditional Gradient Type Methods
Zhuo, Jiacheng, Liu, Liu, Caramanis, Constantine
Structured statistical estimation problems are often solved by Conditional Gradient (CG) type methods to avoid the computationally expensive projection operation. However, the existing CG type methods are not robust to data corruption. To address this, we propose to robustify CG type methods against Huber's corruption model and heavy-tailed data. First, we show that the two Pairwise CG methods are stable, i.e., do not accumulate error. Combined with robust mean gradient estimation techniques, we can therefore guarantee robustness to a wide class of problems, but now in a projection-free algorithmic framework. Next, we consider high dimensional problems. Robust mean estimation based approaches may have an unacceptably high sample complexity. When the constraint set is a $\ell_0$ norm ball, Iterative-Hard-Thresholding-based methods have been developed recently. Yet extension is non-trivial even for general sets with $O(d)$ extreme points. For setting where the feasible set has $O(\text{poly}(d))$ extreme points, we develop a novel robustness method, based on a new condition we call the Robust Atom Selection Condition (RASC). When RASC is satisfied, our method converges linearly with a corresponding statistical error, with sample complexity that scales correctly in the sparsity of the problem, rather than the ambient dimension as would be required by any approach based on robust mean estimation.
Inexact and Stochastic Generalized Conditional Gradient with Augmented Lagrangian and Proximal Step
Silveti-Falls, Antonio, Molinari, Cesare, Fadili, Jalal
In this paper we propose and analyze inexact and stochastic versions of the CGALP algorithm developed in the authors' previous paper, which we denote ICGALP, that allows for errors in the computation of several important quantities. In particular this allows one to compute some gradients, proximal terms, and/or linear minimization oracles in an inexact fashion that facilitates the practical application of the algorithm to computationally intensive settings, e.g. in high (or possibly infinite) dimensional Hilbert spaces commonly found in machine learning problems. The algorithm is able to solve composite minimization problems involving the sum of three convex proper lower-semicontinuous functions subject to an affine constraint of the form $Ax=b$ for some bounded linear operator $A$. Only one of the functions in the objective is assumed to be differentiable, the other two are assumed to have an accessible prox operator and a linear minimization oracle. As main results, we show convergence of the Lagrangian to an optimum and asymptotic feasibility of the affine constraint as well as weak convergence of the dual variable to a solution of the dual problem, all in an almost sure sense. Almost sure convergence rates, both pointwise and ergodic, are given for the Lagrangian values and the feasibility gap. Numerical experiments verifying the predicted rates of convergence are shown as well.
- Europe > France (0.04)
- North America > United States > New York (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Locally Accelerated Conditional Gradients
Carderera, Alejandro, Diakonikolas, Jelena, Pokutta, Sebastian
Conditional gradient methods form a class of projection-free first-order algorithms for solving smooth convex optimization problems. Apart from eschewing projections, these methods are attractive because of their simplicity, numerical performance, and the sparsity of the solutions outputted. However, they do not achieve optimal convergence rates. We present the Locally Accelerated Conditional Gradients algorithm that relaxes the projection-freeness requirement to only require projection onto (typically low-dimensional) simplices and mixes accelerated steps with conditional gradient steps to achieve local acceleration. We derive asymptotically optimal convergence rates for this algorithm. Our experimental results demonstrate the practicality of our approach; in particular, the speedup is achieved both in wall-clock time and per-iteration progress compared to standard conditional gradient methods and a Catalyst-accelerated Away-Step Frank-Wolfe algorithm.